feat(bench): flip sandbox-e (schema compression) to ACTIVE — first new ACTIVE since vllm-q4-llama8b by OpenCircuitDev · Pull Request #51 · OpenCircuitDev/opencircuitmodel

OpenCircuitDev · 2026-05-09T22:59:22Z

Summary

First sandbox to flip from INACTIVE to ACTIVE since the framework shipped. Sandbox E (schema compression) measures the input-token reduction from OCM's canonical MCP-tool compression recipe, with no model invocation needed for the primary metric — pure deterministic measurement.

Local validation

Ran end-to-end on the actual workload via direct `python bench.py` (no Docker):

Field	Value
primary_value	70.12% median reduction
confirm threshold	≥ 30%
verdict	CONFIRMED
reason	`primary 70.121 >= confirm_at_least 30.0`
tokenizer	cl100k_base (real tiktoken)
tokens median	117.5 → 28.5
n_tools	30
duration	35 ms

Spec impact

Spec v0.2 row 21 claimed 30-60% reduction. Measured 70% — exceeds the upper bound. Worth a follow-up note in spec hygiene: the recipe is more aggressive than originally claimed; secondary accuracy validation becomes proportionally more important when model-dependent harness lands.

Frame

First new ACTIVE flip = ~700 lines of work (workload generator + 30-tool fixture + bench.py + compose + expected.json refit). Sets the recipe for the other 12 INACTIVE stubs as their `blocked_on` items resolve.

What this changes

2 ACTIVE sandboxes total (vllm-q4-llama8b + sandbox-e)
12 INACTIVE
Workload registry grows: `bench/workloads/mcp-tool-defs-30.jsonl`
Reusable generator script: `bench/workloads/_generate_mcp_tool_defs.py`

🤖 Generated with Claude Code

Resolves all 3 blocked_on items the original INACTIVE stub listed, without needing the full MCP-multiturn-with-model harness: - workload curated: bench/workloads/mcp-tool-defs-30.jsonl (30 representative MCP tool defs across 6 categories — filesystem, web, code, calendar, email, system) - bench.py: applies canonical schema compression (strip descriptions, shorten param names, hide optional params), counts tokens before/after via cl100k_base (with deterministic char-div-4 fallback), reports median pct reduction - docker-compose.yml: minimal python:3.11-slim container with tiktoken installed; reads workload from /workloads/, writes outputs.json - expected.json: status flipped ACTIVE; secondary metric (tool-call accuracy delta) explicitly removed and tracked as a future paired model-dependent sandbox Local end-to-end measurement (no Docker, direct python bench.py): primary_value: 70.12% median reduction (cl100k_base tokenizer) threshold: confirm_at_least=30% verdict: CONFIRMED — well above the 30% bar Also locked in this PR: - .gitignore: bench/isolation/**/outputs.json (per-run artifact, not source of truth — bench/results/ holds the canonical summaries) - generator script for the workload (deterministic — re-run produces identical output) Net effect: bench framework now has 2 ACTIVE sandboxes (vllm-q4-llama8b + sandbox-e), 12 INACTIVE. dry-run-all reports cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

OpenCircuitDev merged commit 8af15ab into main May 9, 2026
1 check passed

OpenCircuitDev deleted the feat/sandbox-e-schema-compression-active branch May 9, 2026 23:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): flip sandbox-e (schema compression) to ACTIVE — first new ACTIVE since vllm-q4-llama8b#51

feat(bench): flip sandbox-e (schema compression) to ACTIVE — first new ACTIVE since vllm-q4-llama8b#51
OpenCircuitDev merged 1 commit into
mainfrom
feat/sandbox-e-schema-compression-active

OpenCircuitDev commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

OpenCircuitDev commented May 9, 2026

Summary

Local validation

Spec impact

Frame

What this changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants